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Abstract Monte Carlo methods represent the de facto 
standard for approximating complicated integrals in¬ 
volving multidimensional target distributions. In order 
to generate random realizations from the target distri¬ 
bution, Monte Carlo techniques use simpler proposal 
probability densities to draw candidate samples. The 
performance of any such method is strictly related to 
the specification of the proposal distribution, such that 
unfortunate choices easily wreak havoc on the result¬ 
ing estimators. In this work, we introduce a layered 
(i.e., hierarchical) procedure to generate samples em¬ 
ployed within a Monte Carlo scheme. This approach 
ensures that an appropriate equivalent proposal den¬ 
sity is always obtained automatically (thus eliminating 
the risk of a catastrophic performance), although at 
the expense of a moderate increase in the complexity. 
Furthermore, we provide a general unified importance 
sampling (IS) framework, where multiple proposal den¬ 
sities are employed and several IS schemes are intro¬ 
duced by applying the so-called deterministic mixture 
approach. Finally, given these schemes, we also propose 
a novel class of adaptive importance samplers using a 
population of proposals, where the adaptation is driven 
by independent parallel or interacting Markov Chain 
Monte Carlo (MCMC) chains. The resulting algorithms 
efficiently combine the benefits of both IS and MCMC 
methods. 
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1 Introduction 

Monte Carlo methods currently represent a maturing 
toolkit widely used throughout science and technology 
[2DJ|371[^. Importance sampling (IS) and Markov Chain 
Monte Carlo (MCMC) methods are well-known Monte 
Carlo (MC) techniques applied to compute integrals in¬ 
volving a high-dimensional target probability density 
function (pdf) 7r(x). In both cases, the choice of a suit¬ 
able proposal density g(x) is crucial for the success of 
the Monte Carlo based approximation. For this reason, 
the design of adaptive IS or MCMC schemes represents 
one of the most active research topics in this area, and 
several methods have been proposed in the literature 

[I1IIS11I1I1711S3]. 

Since both IS and MCMC have certain intrinsic ad¬ 
vantages and weaknesses, several attempts have been 
made to successfully marry the two approaches, pro¬ 
ducing hybrid techniques: IS-within-MCMC [3E1ISI1 
[32143] or MCMC-within-IS [5ll2[I2l3211Illl2l5l].To 
set the scene for such developments it is useful to recall 
briefly some of the main strengths of IS and MCMC, 
respectively. For instance, one benefit of IS is that it 
delivers a straightforward estimate of the normalizing 
constant of 7f(x ) [32117] (a.k.a. evidence or marginal 
likelihood), which is essential for several applications 
[751 112. In contrast, the estimation of the normaliz¬ 
ing constant is highly challenging using MCMC meth¬ 
ods, and several authors have investigated different ap¬ 
proaches to overcome the obstacles related to the in¬ 
stability of the resulting estimators 12112 [12125ll52. 
Furthermore, the application and the theoretical anal¬ 
ysis of an IS scheme using an adaptive proposal pdf is 
easier than the theoretical analysis of the corresponding 
adaptive MCMC scheme, which is much more delicate 

m- 
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On the other hand, an appealing feature of MCMC 
algorithms is their explorative behavior. For instance, 
the proposal function g(x|xt_i) can depend on the pre¬ 
vious state of the chain Xt_i and foster movements be¬ 
tween different regions of the target density. For this 
reason, MCMC methods are usually preferred when no 
detailed information about the target 7f(x) is available, 
especially in large dimensional spaces [H 121]. More¬ 
over, in order to amplify their explorative behavior sev¬ 
eral parallel MCMC chains can be run simultaneously 
[47l [30] . This strategy facilitates the exploration of the 
state space, although at the expense of an increase 
in the computational cost. Several schemes have been 
introduced to share information among the different 
chains [I1IMII371, which further improves the overall 
convergence. 

The main contribution of this work is the descrip¬ 
tion and analysis of a hierarchical proposal procedure 
for generating samples, which can then be employed 
within any Monte Carlo algorithm. In this hierarchical 
scheme, we consider two conditionally independent lev¬ 
els: the upper level is used to generate mean vectors 
for the proposal pdfs, which are then used in the lower 
level to draw candidate samples according to some MC 
scheme. We show that the standard Population Monte 
Carlo (PMC) method [H] can be interpreted as apply¬ 
ing implicitly this hierarchical procedure. 

The second major contribution of this work is pro¬ 
viding a general framework for multiple importance sam¬ 
pling (MIS) schemes and their iterative adaptive ver¬ 
sions. We discuss several alternative applications of the 
so-called deterministic approach [211 [Ml ISD] for sam¬ 
pling a mixture of pdfs. This general framework in¬ 
cludes different MIS schemes used within adaptive im¬ 
portance sampling (AIS) techniques already proposed 
in literature, such as the standard PMC [T2] , the adap¬ 
tive multiple importance sampling (AMIS) [ISlISl], and 
the adaptive population importance sampling (APIS) 

m- 

Finally, we combine the general MIS framework with 
the hierarchical procedure for generating samples, in¬ 
troducing a new class of AIS algorithms. More specifi¬ 
cally, one or several MCMC chains are used for driving 
an underlying MIS scheme. Each algorithm differs from 
the others in the specific Markov adaptation employed 
and the particular MIS technique applied for yielding 
the final Monte Carlo estimators. This novel class of 
algorithms efficiently combines the main strengths of 
the IS and the MCMC methods, since it maintains an 
explorative behavior (as in MCMC) and can still easily 
estimate the normalizing constant (as in IS). 

We describe in detail the simplest possible algorithm 
of this class, called random walk importance sampling. 


Moreover, we introduce two additional population-based 
variants that provide a good trade-off between per¬ 
formance and computational cost. In the first variant, 
the mean vectors are updated according to indepen¬ 
dent parallel MCMC chains. In the other one, an in¬ 
teracting adaptive strategy is applied. In both cases, 
all the adapted proposal pdfs collaborate to yield a 
single global IS estimator. One of the proposed algo¬ 
rithms, called parallel interacting Markov adaptive im- 
portanee sampling (PI-MAIS), can be interpreted as 
parallel MCMC chains cooperating to produce a single 
global estimator, since the chains exchange statistical 
information to achieve a common purpose. 

The rest of the paper is organized as follows. Sec¬ 
tion]^ is devoted to the problem statement. The hier¬ 
archical proposal procedure is then introduced in Sec¬ 
tion 1^ In Section we describe a general framework 
for importance sampling schemes using a population of 
proposal pdfs, whereas Section introduces the adap¬ 
tation procedure for the mean vectors of these pro¬ 
posal pdfs. Numerical examples are provided in Sec¬ 
tion including comparisons with several benchmark 
techniques. Different scenarios have been considered: 
a multimodal distribution, a nonlinear banana-shaped 
target, a high-dimensional example, and a localization 
problem in a wireless sensor network. Finally, Section 
Ul contains some brief final remarks. 


2 Target distribution and related integrals 

In this work, we focus on the Bayesian applications of 
IS and MCMC. However, the algorithms described may 
also be used for approximating any target distribution 
that needs to be handled by simulation methods. Let 
us denote the variable of interest as x G A C M-®”:, and 
let y G be the observed data. The posterior pdf is 
then given by 

,(x)=p(x|y) = l(jtm, (I) 

where ^(y|x) is the likelihood function, (?(x) is the prior 
pdf, and Z{y) is the model evidence or partition func¬ 
tion. In general, Z(y) is unknown, so we consider the 
corresponding unnormalized target, 

7r(x) = £(y|x)g(x). (2) 

Our goal is computing efficiently some integral measure 
w.r.t. the target pdf, 

= 4 / /(x)’^(x)c^x, (3) 

A* 
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where 

Z = f 7r(x)(ix, (4) 

Jx 

and / is any square-integrable function (w.r.t. 7f(x)) of 
xj^In this work, we address the problem of approximat¬ 
ing I and Z via Monte Carlo methods. Since drawing 
directly from 7f(x) oc 7r(x) is impossible in many appli¬ 
cations, Monte Carlo techniques use a simpler proposal 
density g(x) to generate random candidates, testing or 
weighting them according to some suitable rule. Indeed, 
throughout the paper we focus on the combined use of 
several proposal pdfs, denoted as qi,... ,qj. 

3 Hierarchical procedure for proposal 
generation 

The performance of MC methods depends on the dis¬ 
crepancy between the target, 7r(x) oc 7r(x), and the pro¬ 
posal ( 7 (x). Namely, the performance improves if (?(x) 
is more similar (i.e., closer) to 7f(x). In general, tun¬ 
ing the parameters of the proposal is a difficult task 
that requires statistical information of the target dis¬ 
tribution. In this section, we deal with this important 
issue, focusing on the mean vector of the proposal pdf. 
More specifically, we consider a proposal pdf defined by 
a mean vector fi and covariance matrix C, denoted as 
q{x\fi, C) = q{x — fi\C). We propose the following hier¬ 
archical procedure for generating a set of samples that 
will be employed afterwards within some Monte Carlo 
technique: 

1. For j = 1,..., J: 

(a) Draw a mean vector fij ^ h{(Jb). 

(b) Draw xj"*^ ^ g(x|pj, C) for m = I,..., M. 

2. Use all the generated samples, xj"*^ for j = I,..., J 
and m = I,..., M, as candidates within some Monte 
Carlo method. 

Note that h{fi) plays the role of a prior pdf over the 
mean vector of q in this approach. Hence, the pdf of 
each sample Xj can be expressed as 

q{x\C)=[ g(x|/x, C)h(M)d/x, (5) 

Jx 

i.e., the hierarchical procedure is equivalent to draw¬ 
ing directly x^"*^ ^ 9(^1 C) for all j = I,..., J and 
m = The density q is thus the equivalent 

proposal density of the whole hierarchical generating 

^ Note that, as both w(x) and Z depend on the observations 
y, the use of 7r(x|y) and Z{y) would be more precise. However, 
since the observations are fixed, in the sequel we remove the 
dependence on y to simplify the notation. 


procedure. Note also that the samples /^i,... ,/xj are 
not directly used by the Monte Carlo estimator, since 
only the samples x^™^, for j = I,..., J, m = 1,..., M, 
enter the actual estimator. Hence, the computational 
cost per iteration of this hierarchical procedure is higher 
than the cost of a standard approach. However, it leads 
to substantial computational savings in terms of im¬ 
proved convergence towards the target, and thus a re¬ 
duced number of iterations required, as shown later in 
the simulations. Furthermore, note that the generation 
of the /Xj’s in the upper level is independent of the sam¬ 
ples drawn in the lower level, thus facilitating the 
theoretical analysis of the resulting algorithms, as dis¬ 
cussed in Section [ 5 .iPI 

3.1 Optimal prior h*(/x) 

Assuming that the parametric form of (/(xj/x, C) and its 
covariance matrix C are fixed, we consider the problem 
of hnding the optimal prior h*{fj,\C) over the mean vec¬ 
tor fi. Note that, since q{x\fi,C) = q{x — fi\C), we can 
write 

q{x\C)= [ q{x-fi\C)h{fi\C)dfi. (6) 

Jx 

regardless of the choice of the prior over the mean vec¬ 
tors in the upper level. The desirable scenario is to have 
the equivalent proposal ^(xlC) coinciding exactly with 
the target 7f(x )gi,e., 

9 (x|C) = /* g(x-^|C)/i*(#(|C)ci^ = 7f(x), (7) 

Jx 

where h*{fi\C) represents the optimal prior. 


3.2 Asymptotically optimal choice of the prior h{fi) 

Since Eq. Q cannot be solved analytically in general, 
in this section we relax that condition and look for 
an equivalent proposal q which fulfills 0 asymptoti¬ 
cally as J —>■ 00 . For the sake of simplicity, let us set 

^ Note that, in the ideal case described here, each /Xj is 
also independent of the other /x’s. However, in the rest of 
this work, we also consider cases where correlation among 
the mean vectors (pi,.. ., /xj) is introduced. 

® Given a function /(x), the optimal proposal q minimiz¬ 
ing the variance of the IS estimator is 5^(x|C) oc |/(x)|7f(x). 
However, in practical applications, we are often interested in 
computing expectations w.r.t. several /’s. In this context, a 
more appropriate strategy is to minimize the variance of the 
importance weights. In this case, the minimum variance is 
attained when g(x|C) = 7f(x) m- 
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M = 1. Thus, we consider the generation of J sam¬ 
ples {xi,..., xj}, drawn using the following hierarchi¬ 
cal procedure: 

(a) Draw a mean vector fij ~ 

(b) Draw Xj ^ q{x\fij, C). 

Note that we are using J different proposal pdfs, 




to draw {xi,... ,xj}, with each Xj being drawn from 
the j-th proposal Xj ~ q{x\fj,j, C). However, if the sam¬ 
ples xi,... ,xj are used altogether regardless of their 
order, then it can interpreted that they have been drawn 
from the following mixture using the deterministic mix¬ 
ture sampling scheme (see [45l Chapter 9], [22]): 

i=i 

Note that, since fij ^ then ip{x) is a Monte Carlo 
approximation of the integral in Eq. Q, i.e., 

ijjix) > g(x|C) = [ q{x-fi\C)h{fi\C)dfi. (9) 

Furthermore, if we choose h(fJ-) = i-e., fJ,j ~ 7r(/x), 

then ipix) is also a kernel density estimator of 7r(x), 
where the q{x\fij, C) play the role of the kernel func¬ 
tions m- In general, this estimator has non-zero bias 
and variance, depending on the choice of g, C and the 
number of samples J. However, for a given value of J, 
there exists an optimal choice of C* which provides the 
minimum Mean Integrated Square Error (MISE) esti¬ 
mator m- Using the optimal covariance matrix C*, it 
can be proved 


^ 7r(x), 


( 10 ) 


i=i 


pointwise as J —>■ oo [5T]. Hence, the equivalent pro¬ 
posal density of the hierarchical approach converges 
to the target when J —)■ oo. It is possible to show 
11(7*11 —>■ 0 as J —>■ oo, so that there is no contradiction 
between Q and (101 since q{x—fj,\C*) becomes increas¬ 
ingly similar to S{x — fi), and thus ^(xlC*) —7r(x) as 
J —>■ oo. 


3.3 Practical implementation 

As explained in Section |3.2[ h{fi) = 7f(/x) is a suit¬ 
able choice from a kernel density estimation point of 
view. However, sampling directly from 7r(/x) is unfeasi¬ 
ble from a practical point of view (otherwise, we would 
not require any MC algorithm). Therefore, we propose 


applying another sampling method, such as an MCMC 
algorithm, to obtain the samples {/xi,..., fJ-j} ~ tf(/x). 
More specifically, starting from an initial /xq, we gener¬ 
ate a sequence 


jjij ^ K, j — 1,... ,J, 

where K is the kernel of the MCMC technique used. 
With the choice h{fi) = 7t{iJi), the two levels of the 
sampler play different roles: 

— The upper level attends the need for exploration of 
the state space, providing {/xi,..., /xj}. 

— The lower level is devoted to the approximation of 
local features of the the target, using {xi,..., x,/}. 

In general, the two levels require their own tuning of 
the parameters of the corresponding proposals. 


3.4 Relationship with other adaptive MC schemes 


In contrast to the hierarchical approach described pre¬ 
viously, in standard adaptive MC approaches [9l [271 [33] 
the parameter /.t„ is determined by a deterministic func¬ 
tion, 

7 : KMxz>,x(n-i) ^ 

of the previously generated samples (assuming to gen¬ 
erate M samples from each proposal). 


X,_i = [xW, 


AM) 


jl) MM), 
S- 1 ’ ■ • ■ Mj-iJ, 


namely. 


/X, = 7(X,-i). (II) 

Although 7 is a deterministic function, the sequence 
is generated according to a conditional pdf, 
Ar(/Xj|/xi,...,/Xj_i), since Xj_i is random. Unlike in 
the hierarchical scheme, in standard adaptive MC ap¬ 
proaches, the sequence typically converges to 

a fixed vector. 

In the standard PMC method m the sequence of 
mean vectors fij’s is also generated depending on the 
previous x’s but, in this case, the final distribution 
is unknown and it is not a fixed vector, in general 
(for further details see Appendix 0- Similar consid¬ 
erations also apply for Sequential Monte Carlo (SMC) 
schemes [42l [48] where the adaptation is performed 

using a combination of resampling and MCMC steps. 
Other interesting and related techniques are the Parti¬ 
cle MCMC (P-MCMC) [3] and the Sequentially Inter¬ 
acting MCMC (SI-MCMC) [5] methods. In this case, 
IS approximations of the target are used to build bet¬ 
ter proposal pdfs, employed within MCMC steps. Both 
methods are also able to provide efficient estimators 
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of Z. However, unlike in PMC, SMC, P-MCMC and 
SI-MCMC, in the proposed hierarchical approach each 
/tj is always chosen independently of Xj_i and it is 
distributed according to decided in advance by 

the user. Moreover, the means /ti,...,are not in¬ 
volved in the resulting estimators. Related observations 
are provided in Section [SJ] and Table 


4 Generalized Multiple Importance Sampling 

So far, we have introduced a hierarchical procedure to 
generate candidates for an MC technique, adapting the 
mean vectors of a set of proposal densities. In this sec¬ 
tion, we provide a general framework for multiple im¬ 
portance sampling (MIS) techniques using a population 
of proposal densities, which embeds various alternative 
schemes proposed in the literature [52]. First, we con¬ 
sider several alternatives of static MIS, and then we 
focus on the corresponding adaptive MIS samplers. 


4.1 Generalized Static Multiple Importance Sampling 

As we have already highlighted, finding a good proposal 
pdf, (/(x), is critical and is in general very challenging 
[IS] . An alternative strategy consists in using a popula¬ 
tion of proposal pdfs. This approach is often known in 
the literature as multiple importance sampling (MIS) 
Uni|46l|^|22| . Consider a set of J proposal pdfs, 

9 i(x),...,gj(x), 


In both cases, the consistency of the estimators is en¬ 
sured [22] . The main advantage of the DM-MIS weights 
is that they yield more efficient estimators than using 
the standard importance weights [ismidiiisH]- How¬ 
ever, the DM-MIS estimator is computationally more 
expensive, as it requires JM total evaluations for each 
proposal instead of just M, for computing all the weights. 
The number of evaluations of the target 7r(x) is the 
same regardless of whether the weights are calculated 


according to Eq. (12) or (131, so this increase in compu¬ 


tational cost may not be relevant in many applications. 
However, in some other cases this additional computa¬ 
tional load may be excessive (especially for large values 
of J) and alternative efficient solutions are desirable. 
For instance, the use of partial mixtures has been pro¬ 
posed in [5T] : 

(c) PortiaZ DM-MIS (P-DM-MIS) [2T]: divide the J pro¬ 
posals in L = p disjoint groups forming L mix¬ 
tures with P components. Let us denote the set of 
P indices corresponding to the £-th mixture (£ = 
as Si = {ki^i,... ,kip} (i.e., I^S^I = P), 
where each ki^p S {1,..., J}. Thus, we have 


U 52 U...U5l = {!,..., J}, 


(14) 


with Sr n Si = 0, for all £ = 1,L, and r ^ £. In 
this case, the importance weights are defined as 


= 1 


r(x 0 


1 z. ' 


(15) 


with j S Si, £ = 1,..., L, and m = 1,..., M. 


with heavier tails than the target tt, and let us assume 
that M samples are drawn from each of them, i.e., 

--gj(x), j = l,...,J, m = l,...,M. 

In this scenario, the weights associated to the samples 
can be obtained following at least one of these two 
strategies: 

(a) Standard MIS (S-MIS): 



for j = 1,..., J and m = 1,..., M, 

(b) Deterministic mixture MIS (DM-MIS) [^150] : 



for j = 1,..., J and m = 1,..., M, and where V’(x) = 
J J2j=i (x) is the mixture pdf, composed of all the 
proposal pdfs. This approach is based on the con¬ 
siderations provided in Appendix [B] 


All the previous cases can be captured by a generic 
mixture-proposal ^j(x), under which the MIS weights 
can be defined as 


(m) 




(16) 


with m = 1,..., M, where = qj(xj) in Eq. 

'^i(Xj^) = 7 ELi 9fc(xj'"^) in Eq. and 


)> with j G Si, 


(17) 




in Eq. (151. In any case, the weights are always normal¬ 
ized as 


-(m) 

Pj = 


Table 


w 


(m) 


M (r) ■ 


i=l 2Zr=l 


(18) 


shows these three choices of ^j(xj™^), whereas 


Table 5 summarizes a generalized static MIS procedure. 
















6 


L. Martino* et al. 


Table 1 Three possible functions ^j(x) for MIS. 


MIS approach 

Function <Pj{x), 

L 

P 

{3 = 1,■■■N) 

LP 

= J 

Standard MIS 

Qj (x) 

J 

1 

DM-MIS 

V’(x) = 7 E)'=i'?j(x) 

1 

J 

Partial DM-MIS 

^Efees, Qk{^) 

L 

P 


Table 2 Generalized static MIS scheme. 


1. Generation: Draw M samples from each qj, i.e., 




' 91 (x), 


for j = 1,. .., J, and with m = 1,..., M. 

2. Weighting: Assign to each sample the weight 


( i'>^) \ 

! X TTlX^ ’ ) 

(■m) ^ 1 ' 

W^- = -7-^ 




(19) 


where is a mixture of g^ ’s, as shown in Table 

3. Normalization: Set 


p'"*) = 


(m) 


Eti E" 1 ' 

4. Output: Return all the pairs for j = 

1,... ,J and m = 1,..., M. 


Note that the IS estimator / of a specific moment 
of 7f, i.e., the integral I given in Eq. ([^, and the ap¬ 
proximation Z of the normalizing constant in Eq. (|^, 
can now be approximated as 


,7 M 

j — 1 m—1 
. J M 

j—1 m—1 


( 20 ) 


Then, the particle approximation of the measure of tt 
is given by 


.(JM) 


(X) = 


1 

JMZ 


J M 

EE 

j—1 m—1 


(n 

w) 


^S(x- 


( 21 ) 


In Section [4.2[ we describe a framework where a partial 
grouping of the proposal pdfs arises naturally from the 
sampler’s definition. 


parameters of the proposal iteratively using the infor¬ 
mation of the past samples da [151138]. In this adaptive 
scenario, we have a set of proposal pdfs {g„^t(x)}, with 
n = and t = where the subscript 

t indicates the iteration index, T is the total number 
of adaptation steps, and J = NT is the total num¬ 
ber of proposal pdfs. In the following, we present a 
unified framework, called generalized adaptive multi¬ 
ple importance sampling (GAMIS), which includes sev¬ 
eral methodologies proposed independently in the liter¬ 
ature, as particular cases. In GAMIS, each proposal pdf 
in the population {qn,t} is updated at every iteration 
t = 1,..., r, forming the sequence 

<?n.i(x),g„,2(x),...,g„,T(x), 

for the n-th. proposal (see Figure [^. At the f-th itera¬ 
tion, the adaptation procedure takes into account sta¬ 
tistical information about the target distribution gath¬ 
ered in the previous iterations, 1,... — 1, using one 

of the many procedures that have been proposed in the 
literature III1II1II3I3BI- Furthermore, at the t-th iter¬ 
ation, M samples are drawn from each proposal 

xi™^ ~ gn.t(x), with TO= 1,...,M, 

n = 1,..., Af and t = 1,..., T. An importance weight 
is then assigned to each sample . Several 
strategies can be applied to build considering the 
different MIS approaches, as discussed in the previous 
section. Figure [^provides a graphical representation of 
this scenario, by showing both the spatial and temporal 
evolution of the J = NT proposal pdfs. 



Fig. 1 Graphical representation of the J = NT proposal pdfs 
used in the generalized adaptive multiple IS scheme, spread 
through the state space X (n = 1,... ,N) and adapted over 
time (7 = 1,...,T). Three different mixtures are displayed: 
-b(x) involving all the proposals, <)>t(x) involving only the pro¬ 
posals at the t-th iteration, and 5n(x) considering the tempo¬ 
ral evolution of the n-th proposal pdf. 


4.2 Generalized Adaptive Multiple Importance 
Sampling 

In order to decrease the mismatch between the proposal 
and the target, several Monte Garlo methods adapt the 


In an AIS algorithm, one weight 


(m) 

= 


<Z>n,(x«) 


( 22 ) 
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is associated to each sample In a standard MIS 

approach, the function employed in the denominator is 


^n.t(x) = g„,t(x). 


(23) 


In the complete DM-MIS case, the function t is 


N 


= V’(x) = 


(24) 


k—1 r=l 


This case corresponds to the external blue rectangle 
in Fig. [2 Two natural alternatives of partial DM-MIS 
schemes appear in this scenario. The first one uses the 
following partial mixture 


<?n.t(x) = ^„(x) = - ^ <7n.r(x), 


(25) 


with n = I,..., TV, in the denominator of the IS weight. 
Namely, we consider the temporal evolution of the n-th 
single proposal qn^t- Hence, we have L = N mixtures, 
each one formed by P = T components (horizontal red 
rectangle in Fig.[^. The other possibility is considering 
the mixture of all the qn,ts at the f-th iteration, i.e.. 


1 ^ 

<^n.t(x) = MA = W 


(26) 


for f = I,..., r, so that we have L = T mixtures, each 
one formed by P = iV components (vertical green rect¬ 
angle in Fig. [^. The function in Eq. ( |^ is used 
in the standard PMC scheme [T2]; Eq. (251 with N = \ 


has been considered in adaptive multiple importance 


sampling (AMIS) [TS]. Eq. (261 has been applied in the 
adaptive population importance sampling (APIS) algo¬ 
rithm |38j . whereas in other techniques, such as Mix¬ 
ture PMC mi na ng, a similar strategy is employed 
but using a standard sampling of the mixture <ptA)- 
Table summarizes all the possible cases discussed 
above. The last row corresponds to a generic grouping 
strategy of the proposal pdfs qn,t- As previously de¬ 
scribed, we can also divide the J = NT proposals into 
L = disjoint groups forming L mixtures with P 
components. We denote the set of P pairs of indices cor¬ 
responding to the £-th mixture {£= 1,L) as Si = 
■ ■ ■ ^Ae,PNi,p)}^ where kg^p S {I,...,A^}, 
rg,p S {I,... ,r} (i.e., |5^| = P, with each element be¬ 
ing a pair of indices), and SrCiSg = 0 for all £ = 1 ,..., L, 
and r £. In this scenario, we have 


<?n.t(x) = ;^ I] 

{k,r)GSi 


qk,r{A^ with {n,t)GSi. (27) 


Note that, using tpA) ^n(x), the computational 
cost per iteration increases as the total number of it¬ 
erations T grows. Indeed, at the t-th iteration all the 
previous proposals qn,ii ■ ■ ■, qn,t-i (for all n) must be 
evaluated at all the new samples . Hence, algo¬ 
rithms based on these proposals quickly become unfea¬ 
sible as the number of iterations grows. On the other 
hand, using (ptA) the computational cost per iteration 
is controlled by N, remaining invariant regardless of the 
number of adaptive steps performed. 

Observe also that a suitable AIS scheme builds iter¬ 
atively a global IS estimator which uses the normalized 
weights 


(m) 

W„,t' 


(m) ’ 

Z^r=l Z^n=l Z^m=l 


(28) 


for n = 1,..., Ai, m = 1,..., M, and t = I,..., T. 

TablelHshows an iterative version of GAMIS. We re¬ 
mark that, at the t-th iteration, the weights of the sam¬ 
ples previously generated need to be recalculated, as 
shown in step 2(c-3) of Table The choices <Pn,tA) = 
qn,t{A O'" 'k’n,t{A = A A) allow avoiding completely 
this re-computation step of the weights. For simplicity, 
in Table we have provided the output of the algo¬ 
rithms as weighted samples, i.e., all the pairs {x^”j\ 
However, the output can be equivalently expressed as 
an estimator of a specific moment of the target. In this 
case, the final IS estimators It and Zt are 


T N M 


= E E E 


T—l n—l m—1 

^ T N M 

^ NMT E E E ’ 

r—1 n—l m—1 


(m) 


(29) 


where pAr = „ ■ Moreover, the final particle ap- 

’ N AIT At 

proximation is 


r(™)(x) = 


1 

NMTZt 


T N M 




r — 1 n—l m—1 


(30) 


The estimators in Eq. (291 can be expressed recursively, 
thus providing an estimate at each iteration t, as stated 
before. Starting with Hq = 0, Iq = 0, and setting St = 
En=i EEi and Ht = Ht-i +St, we have 


A = E 


Ht 


Ht- 


N M 

n—l m—1 

St 


t-1 


Ht-i + S, 


■It-i + 


Ht-i + S, 


A, 


(31) 
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Table 3 Summary of possible MIS strategies in an adaptive framework. 


MIS approach 

Function 

J 

L P 

Corresponding Algorithm 

LP = J 

Standard MIS 

qn,t{^) 

AT 

NT 

1 

PMC 112J 

DM-MIS 

V'(x)= ^(^Ej.E.EEl'InAx) 

1 

NT 

suggested in 1211 

Partial DM-MIS 

?r>(x) = y^ELl In, Ax) 

A 

T 

AMIS OAI. with A = 1 

Partial DM-MIS 

<^Ax) = f EC-l Qn.Ax) 

T 

A 

APIS |M] and fm ITfl ITS] 

Partial DM-MIS 

generic <Pn,t(:x-) in Eq. \27\ 

L 

P 

suggested in |21| 


where i* = 




is the partial IS 
estimator using only the samples drawn at the t-th iter¬ 
ation. Therefore, It can be seen as a convex combination 
of the two IS estimators It-i and At (for further expla¬ 


nations see Eqs. (46)-(47) in Appendix B.3|. Finally, 
note that 




(32) 


A brief discussion about the consistency of It and Zt is 
provided in Appendix [Aj 


5 Markov adaptation for GAMIS 

In this section, we design efficient adaptive importance 
sampling (AIS) techniques by combining the main ideas 
discussed in the two previous sections. More specifically, 
we apply the hierarchical MC approach to adapt the 
proposal pdfs within a GAMIS scheme. Therefore, a 
Markov GAMIS technique, or simply Markov Adaptive 
Importance Sampling (MAIS) algorithm, consists of the 
following two layers: 

1. Upper level (Adaptation): Given the set of mean vec¬ 
tors, 

Vt-l = . . . , pAr,t-l}j 

obtain the new set Vt = • ■ • > accord¬ 

ing to MCMC transitions with tt as invariant den¬ 
sity. More specifically, a kernel A'(pi:Ar,t|pi:Ar,t-i) 
leaving invariant the distribution RLi 7r(p„) is ap¬ 
plied. 

2. Lower level (MIS estimator): Given the population 
of proposals. 


Table 4 GAMIS scheme: iterative version. 


1. Initialization: Set t = 1, Hq = 0 and choose N initial 

proposal pdfs q„^o(x). 

2. For t = 1,... ,T: ’ 

(a) Adaptation: update the proposal pdfs 

{gn.t-ilOLi providing using a preestab¬ 

lished procedure (e.g., see [niITTlITSlESI for some 
specific approaches). 

(b) Generation: Draw M samples from each qn,t, i.e., 

~ with n = and ra = 

1,.’. .,M. 

(c) Weighting: 

(c-1) Update the function <Pn,t{'Z) given the current 
population ..., 

(c-2) Assign the weights to the new samples , 


w 


(m) 

n,t 





(m), ’ 
n,t ) 


(33) 


for n = 1,. .., A, and m = 1,..., M. 

(c-3) Re-weight the previous samples for t = 

1,... , t — 1 as 


w 


(m) _ 

n,T — 


^n,t(xi™^) 


(34) 


with T = — 1, n = 1,...,A, and m = 

(d) Normalization: Set St = Em=i E)Li 

Ht = Ht-i + St , and re-normalize all the weights, 


-(m) 

Pn,T 


= o'*"' 

Pn,T — l 


Ht-i 

Ht 


(35) 


for T = 1,... n = 1,..., N, and m = 1,. .., M. 

(e) Output: Return all the pairs {xAU)), for r = 

1,..., t, n = 1,..., A, and m = 1,..., M. 


5.1 Theoretical support: adaptation and consistency 


9i,t(x|/xi,t, Cl),..., gAr,t(x|/xjv,t, Cat), 

choose a function <Pn^t(x.) for the computation of the 
weights in Eq. (22), and perform a MIS approxima¬ 


tion of the target as described in Section 4.2 


The motivation behind the MCMC adaptation has been 
described in Section 3.2 and 3.3 the functions qn,t, lo¬ 
cated at the UnAs, jointly provide a kernel estimate of 
the target tt. 

Furthermore, we recall that the generation of the 
means, pbn,t, is completely independent hour the samples 
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drawn in the lower level. This is a key point from 
a theoretical and practical point of view. Indeed, the 
generic MATS algorithm can be divided in two steps: (a) 
first generate all the means for n = 1,..., A^, 

(b) then perform the MIS estimation considering all 
the proposals C„), Vn and Vt. Namely, any 

MAIS technique can be converted into a generalized 
static MIS scheme (see Section 4.1). As a consequence, 
the unique conditions required for ensuring the consis¬ 
tency of the corresponding estimators are [221117!: 


All the proposal pdfs, qn,t, must have heavier tails 
than the target tt. 

A suitable function for the denominator of 

the importance weights must be chosen. Namely, the 
use of <l>n t (x) provides consistent estimators m. 
like the functions described in Section 


4.2 


Moreover, the independence of the upper level from the 
lower level of the hierarchical approach, helps the par¬ 
allelization of the algorithms as we discuss later. 

Table [^compares different AIS schemes. In the stan¬ 
dard AIS method |5], the sequence of {/.*«,t} converges 
to a unknown fixed vector as t > oo. In the standard 
PMC algorithm [12], the limiting distribution of 
is unknown. Furthermore, in both cases, standard AIS 
and PMC, the adaptation depends on the previously 
generated samples x’s. In MAIS techniques, the use of 
an ergodic chain (with invariant pdf tt) for generating 
the n-th mean vector ensures that its asymptotic 
density is 7r(/x). 


Table 7 Random Walk Importance Sampling (RWIS) algo¬ 
rithm. 


1. Initialization: start with t = 1, Hq = 0, choose the 
values M and T, the initial location parameter fio, the 
scale parameters C and A. 

2. For t = 1,... ,r: 

(a) MH step: 

(a-1) Draw/x'~ 99(p|/it_i,yl). 

(a-2) Set fit = /x' with probability 


a = mm 


1 7r(/x')y(/Xf|Mb2i), 


otherwise set /xt = Mt-i (with probability 1 — 
a). 

(b) IS steps: 

(b-1) Draw qt{^\^t^ C^) for m = 1,..., M. 

(b-2) Weight the samples as 


(m) 

W2 ^ = 


( (^) \ 

gt(x(™^|/xt,C„) 


(b-3) Set St = iit = Rt-i +5t, and 

normalize the weights 




-(m) -(tn) ^t — 1 


(c) Output: Return all the pairs for m = 

1,... ,M and r = 1,..., t. 


Table 5 Adaptation of the mean vectors {/Xn,t} using differ¬ 
ent AIS techniques. 


Features 

Stand. AIS 

PMC 

MAIS 

limiting 
distribution of 
{/Xn.t} for t —> oo 

(unknown) 

fixed 

vector 

unknown 
(if/when 
exists) 

7f(/x) 

dependence of 
the adaptation 
w.r.t. the x’s 

yes 

yes 

no 


5.2 The new class of algorithms 

Markov GAMIS framework can lead to many differ¬ 
ent algorithms, depending on the MCMC strategy used 
to update the mean vectors and the specihc choice of 
the function 'Pn,t- Table provides several examples 
of novel techniques determined by the value of the 
choice of , and the type of MCMC adaptation. Some 
of them are variants of well-known techniques like PMC 


[T5] and AMIS [12], where the Markov adaptation pro¬ 
cedure is employed. Others, such as the Random Walk 
Importance Sampling (RWIS), the Parallel Interacting 
Markov Adaptive Importance Sampling (PI-MAIS) and 
Doubly Interacting Markov Adaptive Importance Sam¬ 
pling (P-MAIS), are described below in detail. For these 
completely novel algorithms we have set ^n,i(x) = ^i>t(x), 
so that the computational cost is directly controlled by 
N and the re-weighting step 2(c-3) in Table is not 
required. 

RWIS is the simplest possible Markov GAMIS algo¬ 
rithm. Specifically, for the MCMC adaptation we con¬ 
sider a standard MH technique, setting N = 1 and 
choosing ^„,t(x) = ^((x) = ( 7 „,t(x) (since N = 1, 
the two cases coincide). Table shows the RWIS al¬ 
gorithm, which is a special case of the more general 
scheme described in Table [8] when N = 1. Note that 
we have a proposal pdf used for the MH adaptation, 
(/j(/x|/xt_i. A), which is different from the proposal pdf 
used for the IS estimation, q{x\pLt, C). 
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Table 6 Example of possible Markov GAMIS algorithms. 



Parallel adaptation 

Interacting adaptation 

Function ^n,t(x} 

N = 1 

N > 1 

N > 1 

9n,t(x) 

RWIS 

(see Table jTjl 

Markov PMC (related to 1121 1 

5n(x) = A <itt,t(x:) 

Markov AMIS 
(related to |15|~) 

N parallel 

Markov AMIS (rel. to [IS]) 

Population-based 
Markov AMIS (rel. to |15|1 

M^) = 

RWIS 

(see Table [7jl 

PI-MAIS 
(see Section 5.3 1 

C-MAIS 
(see Section 5.3 1 

^(x) = ^ELiELi9n,t(x) 

Markov AMIS 
(related to 11511 

Full Markov GAMIS 

generic ^n,t{x) 

Partial Markov GAMIS 


5.3 Population-based algorithms 

The RWIS technique can be easily extended by using a 
population of N proposal pdfs. In this case, we choose 


1 


N 


<l>„,t(x) = (^t(x) = gn.t(x), 

n—1 

so that the computational cost of evaluating the mix¬ 
ture depends only on N, regardless of 

the number t of iterations. Moreover, step 2(c-3) in Ta¬ 
ble]^ is not required in this case. Table describes the 
corresponding algorithm without specifying the MCMC 
approach used for generating the population of means, 
'Pt = {Mi.i, given'Pj-i. 

Two possible adaptation procedures via MCMC are 
discussed below. In the first one, we consider N inde¬ 
pendent parallel chains for updating the N mean vec¬ 
tors. We refer to this method as Parallel Interacting 
Markov Adaptive Importance Sampling (PI-MAIS). Al¬ 
though PI-MAIS is parallelizable, in the iterative ver¬ 
sion of Table the N independent processes cooperate 
together in Eq. ( [M| ) to provide unique global IS esti¬ 
mate. In the second adaptation scheme, we introduce 
the interaction also in the upper level. Hence, we refer 
to this method as Doubly Interacting Markov Adaptive 
Importance Sampling (I^-MAIS). In both cases, the cor¬ 
responding technique provides an IS approximation of 
the target or, equivalently, the estimators It and Zt in 


Eq. (29), using NMT samples. 


5.3.1 MCMC adaptation for PI-MAIS 


The simplest option is applying one iteration of N par¬ 
allel MCMC chains, one for each returning iJ,n,ti 

for n = 1,... ,N. For instance, given N parallel MH 
transitions, each one employing (possibly) a different 
proposal pdf ipn with covariance matrix A„, we have: 


Table 8 Population-Based MAIS algorithms. 


1. Initialization: Set t = 1, /q = 0 and Hq = 0. Choose 
the initial population 

Po = {mi.o, MiV.oli 

and N covariance matrices C„ (n = 1,..., N). Choose 
also the parametric form of the N normalized propos¬ 
als qi^t with parameters and C„. Let T be the 

total number of iterations. 

2. For t = 1,... ,r: 

(a) Update of the location parameters: Perform one 
transition of one or more MCMC techniques over 
the current population, 

Pt-i = {pi.t-i, 
obtaining a new population, 

Pt = {pi.t, ■■■, PN,t}- 


(b) IS steps: 

(b-1) Draw ~ qn,t(x|Mn.ti Cn) for m = 

1,... , M and n = 1,..., N. 

(b-2) Compute the importance weights, 


(m) 

"'n.t = 


) 


(36) 


with n = 1,..., N, and m = 1,..., M. 


(b-3) Set St = E^=i E:^=i Pt = Ht-i + St, 

and normalize the weights 

(m) 


Pn,t 


n,t 


y'A't (D 
A^t=1 Z-/r=l 


_(m) Ht-1 

= 

nt 

(c) Outputs: Return all the pairs for 

m = 1,..., M and r = 1,..., t. 


For n = 1,... ,N\ 


1. Draw/t'~ A„). 
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2. Set fin,t = with probability 


a = mm 


1 , 


)^n (/t-n,* —1 IM 1 ^n) 

l)V^n 1 5 -^n) _ 


otherwise set = Mn.t-i (with probability 1 — a). 

Figure [^a) illustrates this scenario. Each mean vec¬ 
tor iin,t is updated independently from the rest. There¬ 
fore, in PI-MAIS, the interaction among the different 
processes occurs only in the underlying IS layer of the 
hierarchical structure: the importance weights in Eq. 
are built using the partial DM-MIS strategy with 


J2n=i C„). Considerations about 


the parallelization of PI-MAIS are given in Section 5.5 


5.3.2 MCMC adaptation for P-MAIS 

Let us consider an extended state space and an 

extended target pdf 


N 




(37) 


where each marginal 7r(/.t„), for i = 1, ..., Ai, coincides 
with the target in Eq. ([^. In this section, we describe 
three interacting adaptation procedures for the mean 


vectors, which consider the generalized pdf in Eq. (371 


as invariant density. They are represented graphically 
in Eigs. [^b), (c) and (d). 


MH in the extended spaee 


xAT 


The simplest possibility is applying directly a block- 
MCMC technique, transitioning from the matrix 

Pt-i = [Mi.t-i, • ■ ■, AtAf.t-i], 

to the matrix Pj = [/xi ..., Let us consider an 

MH method and a proposal pdf (/3(Pt|Pt_i) : 
t^d^xn ^ Eor instance, one can consider a proposal of the 
type 

'^(Ml.t) • ■ ■ ) • ■ • iMAf.t-l) 

N 

— V^n(Mn,£|Mn,£—Ij Ayj). 

n—1 

Thus, one transition is formed by the following steps: 


At each iteration, N new samples are drawn (as 
in PI-MAIS) and therefore N new evaluations of tt are 
required (i.e., one evaluation of tt^). When a new P' is 
accepted, all the components of Pj differ from Pt_i, un¬ 
like in the strategy described later. However, the proba¬ 
bility of accepting a new population becomes very small 
for large values of N. 

Sample Metropolis-Hastings (SMH) algorithm 


SMH is a population-based MCMC technique, suitable 
for our purposes [301 Chapter 5]. At each iteration t, 
given the previous set 

Pt-l = {H'1.£-1) 


a new possible parameter /xo,£-i, drawn from an in¬ 
dependent proposal ifiifx), is tested to be interchanged 
with another parameter in Vt-i = MAf,£-i}- 

The underlying idea of SMH is to replace one “bad” 
sample in the population Vt-i with a potentially “bet¬ 
ter” one, according to a certain suitable probability a. 
The algorithm is designed so that, after a burn-in pe¬ 
riod, the elements in Vt are distributed according to 
7 rg(/xi,... ,/XAr). One iteration of SMH consists of the 
following steps: 

1. Draw a candidate pio.t-i ~ 7 ’(m)- 

2. Choose a “bad” sample, fik,t-i with fcs {I,...,Ai}, 

from the population according to a probability pro¬ 
portional to ; which corresponds to the in¬ 

verse of the standard IS weights. 

3. Accept the new population, Vt = , VN,t} 

with = pin,t-i for all n ^ fc and fik,t = Vo,t-i, 
with probability 


0.{Vt-l, tiQ.t-l) 


E N y(Atn,t-l) 
n=l 

y(Mct-i) _ ^nin y(Mi,t-i) 
M/i—Q 7r(/xi t-l) Q<i<N 


Otherwise, set Vt = Vt-i. 


Unlike in the previous strategy, the difference between 
Vt-i and Vt is at most one sample. Observe that a de¬ 
pends on Vt-i and the candidate /xo,£-i- However, at 
each iteration, only one new evaluation of tt (and gS) is 
needed at /i.o,£-i, since the rest of the weights have al¬ 
ready been computed in the previous steps (except for 
the initial iteration). 


1. Draw P' ~ (p(P|P 4 _i), where P' = [/x'^,... ,/x)y]. 

2. Set Pt = P' with probability 


a = min 


■ ^g(P')^(P,_l|P') - 

. Wg{Pt-lMP'\Pt-l)_ ’ 


otherwise set Pt = P£-i (with probability 1 — a). 


MH within Gibbs 

Another simple alternative, following an “MH within 
Gibbs” approach for sampling from Ttg, is to update 
sequentially each using one MH step in 

Hence, setting /xo,£ = gtN.t-i, we have: 
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(a) For PI-MAIS (b) For I^-MAIS (c) For P-MAIS 



(d) For P-MAIS 


Fig. 2 Different possible adaptation procedures for Population-based MAIS schemes, (a) One transition of N indepen¬ 
dent parallel MH chains S for Pl-MAIS. (b) One transition of an MH method working in the extended 

space ...,S (c) One transition of SMH 1301 Chapter 5], considering the population of mean vectors 

Vt = (d) N sequential transitions of (possibly) different MH kernels starting from /xo,t = 


For n = 1,... ,N: 

1. Draw fi' from a proposal pdf A„). 

2. Set fin,t = with probability 

. 7r(/x')v5„(/x„-i,t|At', A„) 

a = mm 1, —-r- — -— , 

otherwise set (in,t = Mn-i.t- 

This scenario is illustrated in Fig. Sd). In this case, 
after Titerations of the I^-MAIS scheme, we generate 
a unique MH chain with NT total states, divided in 
T parts of N states. At each iteration of the P-MAIS 
scheme, each block of N states is employed as mean 
vector of the N proposal pdfs used in the lower level. 


r.v.’s for performing the acceptance tests (and addi¬ 
tional r.v.’s for choosing a “bad” candidate in SMH). 
Specifically, we need: V = NT uniform r.v.’s in PI- 
MAIS and P-MAIS with MH-within-Gibbs, V = T uni¬ 
form r.v.’s for P-MAIS with MH in the extended space, 
andF = 2r, T uniform r.v. and T multinomial r.v., for 
P-MAIS with SMH. However, in practical applications, 
the main computational effort is usually required for 
the target evaluation. The computing time required in 
the multinomial sampling within SMH increases with 
N. Finally, we recall that we have used a determin¬ 
istic mixture weighting scheme with j(x) = 
which requires MN'^T evaluations of the proposal pdfs, 
g„_t(x), for n = 1,..., A and t = 1,..., T. 


5.4 Computational cost: comparison between PI-MAIS 
and I^-MAIS 

In all cases, the total number of samples involved in the 
final estimation is NMT. The total number of evalu¬ 
ations of the target, E, is larger due to the MCMC 
implementation, i.e., E > NMT. More precisely, the 
total number of evaluations of the target is: 

- E = MNT + NT, for PI-MAIS, 

- E = MNT + NT, for I^-MAIS with MH in the 
extended space 

- E = MNT + T, for P-MAIS with SMH, 

- E = MNT + AT, for P-MAIS with the MH-within- 
Gibbs approach. 

Note that we have taken into account that several eval¬ 
uations of the target have been computed in the previ¬ 
ous iterations. Moreover, the application of the MCMC 
techniques requires generation of V additional uniform 


5.5 Non-iterative and parallel implementations 


As remarked in Section |5.I[ the choice of the means 
is completely independent from the estimation 
steps. Thus, all the means can be selected in advance 
(also in parallel if the strategy in Section 5.3.1 is used), 
and the MIS estimation steps can then be performed as 
in a completely static framework (i.e., as described in 
Section 4.1). This consideration is valid for any choice 
of ^„,i(x). 


Let us consider now the choice of ^n,i’s as tempo¬ 
ral mixtures, i.e., <Pn,t = or ^n.t(x) = 

Moreover, let us consider the use of A par¬ 
allel MCMC chains for adapting the means, i.e., one 
independent chain for each parameter with n = 

1,..., A. In this case, the corresponding algorithm is 
completely parallelizable. Indeed, it can be decomposed 
into A parallel MAIS techniques, each one producing 
the partial estimators In,T and Zn,T, after T iterations. 
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The global estimators are then given by 

N 


iT = j: 


n=l Ei=l 
1 ^ 




n=l 


(38) 


Furthermore, different strategies for sharing informa¬ 
tion among the parallel chains can also be applied m 
Eg EH Ea [351143, or for reducing the total number of 
evaluations of the target [29] (the scheme in [29] can be 
applied if a unique independent proposal is employed, 
i.e., for all n). 


6 Numerical simulations 

In this section, we test the performance of the proposed 
scheme comparing them with other benchmark tech¬ 
niques. First of all, we tackle two challenging issues for 
adaptive Monte Carlo methods: multimodality in Sec¬ 
tion |6.1| and nonlinearity in Section |6.2[ Furthermore, 
in Section |6.4| we consider an application of position¬ 
ing and tuning model parameters in a wireless sensor 
network duniin]. 


6.1 Multimodal target distribution 

In this section, we test the novel proposed algorithms 
in a multimodal scenario, comparing with several other 
methods. Specifically, we consider a bivariate multi¬ 
modal target pdf, which is itself a mixture of 5 Gaus- 
sians, i.e., 

1 ^ 

7r(x) = - ^A/'(x;i/i,Sj), x e (39) 

with means vi = [- 10 ,- 10 ]^, 1^2 = [0,16]^, 1/3 = 
[13,8]^, 1^4 = [—9,7]^, i /5 = [14,-14]^, and covariance 
matrices Si = [2, 0.6; 0.6, 1], S 2 = [2, —0.4;—0.4, 2], 
S3 = [2, 0.8; 0.8, 2], S4 = [3, 0; 0, 0.5] and S5 = 
[2, —0.1;—0.1, 2]. The main challenge in this exam¬ 
ple is the ability in discovering the 5 different modes 
of 7 r(x) (X 7 r(x). Since we know the moments of 7 r(x), 
we can easily assess the performance of the different 
techniques. 

Given a random variable (r.v.) X ~ ft(x), we con¬ 
sider the problem of approximating via Monte Carlo the 
expected value i?[X] = [1.6,1.4]^ and the normalizing 
constant Z = 1. Note that an adequate approximation 
of Z requires the ability of learning about all the 5 
modes. We compare the performances of different sam¬ 
pling algorithms in terms of Mean Square Error (MSE): 


(a) the AMIS technique [H], (b) three different PMC 
scheme^ two of them proposed in mm and one 
PMC using a partial DM-MIS scheme with = 

(c) N parallel independent MCMC chains and 
(d) the proposed PI-MAIS method. Moreover, we test 
two static MIS approaches, the standard MIS and a par¬ 
tial DM-MIS schemes with <?„_t(x) = (j)t{x), computing 
iteratively the final estimator. 

For a fair comparison, all the mentioned algorithms 
have been implemented in such a way that the num¬ 
ber of total evaluations of the target is A = 2 • 10®. All 
the involved proposal densities are Gaussian pdfs. More 
specifically, in PI-MAIS, we use the following parame¬ 
ters: N = 100, M e {1,19, 99}, T G {20,100,1000} in 
order to f u lfill E = MNT + NT ={M + l)NT = 2-10® 
(see Section 5.4). The proposal densities of the upper 
level of the hierarchical approach, (^„(x|/.t„^t, A„), are 
Gaussian pdfs with covariance matrices A„ = A^l 2 
and A G {5,10,70}. The proposal densities used in 
the lower importance sampling level, ( 3 '„_((x|/x„_t, C„) 
are Gaussian pdfs with covariance matrices C„ = cr^ I 2 
and cr G {0.5,1,2,5,10,20,70}. We also try different 
non-isotropic diagonal covariance matrices in both lev¬ 
els, i.e, A„ = diag(A2_i, A^ 2 ): where Aij ~ Z^([l,10]), 
and C„ = diag(cr^ 4 , 2)1 where (t„j ~ Z//([l,10]) for 

j G {1,2} and n = 1,... N. We test all these techniques 
using two different initializations: first, we choose delib¬ 
erately a “bad” initialization of the initial mean vectors, 
denoted as Ini, in the sense that the initialization re¬ 
gion does not contain the modes of tt. Thus, we can test 
the robustness of the algorithms and their ability to im¬ 
prove the corresponding static approaches. Specifically, 
the initial mean vectors are selected uniformly within 
the following square 


M„,o^Z7([-4,4] X [-4,4]), 

for n = 1,..., TV. Different examples of this configura¬ 
tion are shown in Fig. [^ with squares. Secondly, we also 
consider a better initialization, denoted as In2, where 
the initialization region contains all the modes. Specif¬ 
ically, the initial mean vectors are selected uniformly 
within the following square 

/.t„.o~Z^([-20,20] X [-20,20]), 

for n = 1,..., TV. All the results are averaged over 2 • 10® 
independent experiments. Tables [^ and show the 
Mean Square Error (MSE) in the estimation of the first 
component of E[X], with the initialization Ini and In2 
respectively. Table [m provides the MSE in the estima¬ 
tion of Z with Ini. The best results in each column 

^ The standard PMC method m is described in Section 

El 
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are highlighted in bold-face. In AMIS [15], the mean 
vector and the covariance matrix of a single proposal 
(i.e., = 1 ) are adapted, using = Ci(x) in 

the computation of the IS weights. Hence, in AMIS, we 
have tested different values of samples per iterations 
M G {500,10^, 2 • 10^, 5 • 10^, lO^} andT= §. For the 
sake of simplicity, we directly show the worst and best 
results among the several simulations made with differ¬ 
ent parameters. PI-MAIS outperforms the other algo¬ 
rithms virtually for all the choices of the parameters, 
with both initializations. In general, a greater value of 
T is needed since the proposal pdfs are initially bad 
localized. Moreover, PI-MAIS always improves the per¬ 
formance of the static approaches. These two considera¬ 
tion show the benefit of the Markov adaptation. Hence, 
PI-MAIS presents more robustness with respect to the 
initial values and the choice of the covariance matrices. 
Figure l^a) providing a summary of the results in Ta¬ 
ble [^showing the log(MSE) as function of the log((T), 
for the main compared methods. Figure depicts the 
initial (squares) and final (circles) configurations of the 
mean vectors of the proposal densities for the standard 
PMC and the PI-MAIS methods, in a specific run and 
different values of tr, A G (3, 5}. In both cases, PI-MAIS 
guarantees a better covering of the modes of 7r(x). 


6.2 Nonlinear banana-shaped target distribution 

Here we consider a b i-dimensional “banana-shaped” tar¬ 
get distribution which is a benchmark function in 
the literature due to its nonlinear nature. Mathemati¬ 
cally, it is expressed as 

^(x„..) OC exp (4 - Bx, - xi)^ - , 

where, we have set B = 10, iji = 4, 772 = 5, and 
773 = 5. The goal is to estimate the expected value 
E[X], where X = [Xi,X 2 ] ^ 7 f(a;i, 0 : 2 ), by applying dif¬ 
ferent Monte Carlo approximations. We approximately 
compute the true value E[X] « [—0.4845, 0]^ using an 
exhaustive deterministic numerical method (with an ex¬ 
tremely thin grid), in order to obtain the mean square 
error (MSE) of the following methods: standard PMC 
[T^ . the Mixture PMC [TT], the AMIS [TS|, PI-MAIS 
and P-MAIS with SMH adaptation. 

We consider Gaussian proposal distributions for all 
the algorithms. The initialization has been performed 
by randomly drawing the parameters of the Gaussians, 
with the mean of the n-th proposal given by fin,o 
U{[—Q,—'S\ X [—4,4]), and its covariance matrix given 
by C„ = i 0; 0 cr^^ 2 ]^' We have considered two 

cases: an isotropic setting where G {1,2,..., 10} 
with k = 1 , 2 , and an anisotropic case with random se¬ 
lection of the parameters where an,k ~ Z^([l, 20 ]), with 


fc = 1,2. Recall that in AMIS and Mixture PMC, the 
covariance matrices are also adapted. 

For each algorithm, we test several combinations of 
parameters, keeping fixed the total number of target 
evaluations, E = 2 - 10®. In the standard PMC method, 
described in Section]^, we consider N G {50,100,10®, 5- 
10®} and T = f (here M = 1). In Mixture PMC, 
we consider different number of component in the mix¬ 
ture proposal pdf A^ G {10, 50,100}, and different sam¬ 
ples per proposal S G {100, 200,10®, 2 • 10®, 5 • 10®} 
at each iteration (here T = ^). In AMIS, we test 
S G {500,10®, 2 • 10®, 5 • 10®, 10^} and T = f (we 
recall N = 1). The range of these values of parame¬ 
ters are chosen, after a preliminary study, in order to 
obtain the best performance from each technique. In 
PI-MAIS an R-MAIS, we set N G {50,100}. For the 
adaptation in PI-MAIS, we also consider Gaussian pdfs 
79 „(xj/i,„_t,A„), covariance matrices A„ = A^l 2 with 
A G {3, 5,10, 20}. In R-MAIS, for the SMH method we 
use a Gaussian pdf with mean [0,0]®^ and covariance 
matrix A = A^l 2 and again A G {3,5,10,20}. We test 
M G {1,9,19} for both, so that T = for Pl- 

MAIS and T = for R-MAIS (see Section 5.4). 

The results are averaged 500 over independent sim¬ 
ulations, for each combination of parameters. Table [I^ 
shows the smallest and highest MSE values obtained in 
the estimation of the expected value of the target, aver¬ 
aged between the two components of E[X], achieved by 
the different methods. The smallest MSEs in each col¬ 
umn (each a) are highlighted in bold-face. Pl-MAIS and 
R-MAIS outperform the other techniques virtually for 
all the values of a. In this example, AMIS also provides 
good results. Figure show a graphical representation 
of the results in Table [T^ with the exception of the last 
column. 

Fig.|4]d isplays the initial (squares) and final (circles) 
configurations of the mean vectors of the proposals for 
the different algorithms, in one specific run. Since in 
Mixture PMC and AMIS the covariance matrices are 
also adapted, we show the shape of some proposals as 
ellipses representing approximately 85% of probability 
mass. For, PMC we also depict a last resampling output 
with triangles, in order to show the loss in diversity. 
Unlike PMC, PI-MAIS ensures a better covering of the 
region of high probability. 


6.3 High dimensional target distribution 

Let us consider again a mixture of isotropic Gaussians 
as target pdf, i.e., 

1 ^ 

7f(x) =-^A/'(x;iZfc,i:fc), xGK®®G 


(40) 
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(a) PMC {N = 100, (7 = 3) (b) PMC {N = 100, o- = 5) (c) PI-MAIS {N = 100, A = 3) (d) PI-MAIS {N = 100, A = 5) 

Fig. 3 Initial (squares) and final (circles) configurations of the mean vectors of the proposal densities for the standard PMC 
and the PI-MAIS methods, in different specific runs. The initial configuration corresponds to Ini. 



Fig. 4 Initial (squares) and final (circles) configurations of the mean vectors of the proposal densities for the banana-shaped 
target distribution, in one specific run for the different methods. The Mixture PMC m and AMIS techniques m also adapt 
the covariance matrices (the ellipses show approximately 85% of the probability mass). 


where v>k = • • ■, , and Sk = XpDx for ^ ^ 

{1,2,3}, with being the x identity matrix. 
We set vij = —5, V 2 j = 6, = 3 for all j = 1,..., D^, 

and Xk = 8 for all k G {1,2,3}. The expected value of 
the target 7 r(x) is then E[Xj\ = | for j = 1,..., D^- In 
order to study the performance of the proposed scheme 
as the dimension of the state space increases, we vary 
the dimension of the state space in Eq. (401 testing 
different values of (with 2 < < 50). 


We consider the problem of approximating via Monte 
Carlo the expected value of the target density, and 
we compare the performance of different methods: (a) 
the standard PMC scheme m, (b) N parallel inde¬ 
pendent MH chains (Par-MH), (c) a standard Sequen¬ 
tial Monte Carlo (SMC) scheme [H] and (d) the pro¬ 
posed PI-MAIS method. We test the algorithms with 
N G {100,500}. All the proposal pdfs involved in the 
experiments are Gaussians, with the same covariance 
matrices for all the techniques. The initial mean vec¬ 
tors in all techniques are selected randomly and inde¬ 
pendently as /j,n,o ~ (^([—6 X 6 ]^“^) for n = 1,..., Ai. 

Again, all the mentioned algorithms have been im¬ 
plemented in such a way that the number of total eval¬ 
uations of the target is E = 2 - 10^. More specifically, in 
PI-MAIS, we use two sets of parameters: with N = 100, 
M = 19, T = 100, and with N = 500, M = 19, T = 20 


in order to fulfill E = {M -b 1)NT = 2 • 10® (see Section 


5.41. The proposal pdf of the upper level of the hier¬ 
archical approach, A„), are Gaussian pdfs 

with covariance matrices A„ = and A = 10. The 

proposal pdfs used in the lower importance sampling 
level, (?n,t(x;|/x„_t, C„) are Gaussian pdfs with covari¬ 
ance matrices C„ = again with tr = 10 (for a 

fair comparison with the other techniques). In PMC, 
Par-MH and SMC we use the same proposals with the 
same covariances and initial parameters. As described 
in App. in PMC the adaptation is carried out by 
resampling steps, in SMC an alternation of resampling 
and MH steps is performed whereas, in Par-MH, N in¬ 
dependent MH chains are carried out. 


The results are averaged over 200 independent sim¬ 
ulations. Fig.|^shows the log-MSE in the estimation of 
^[X] as a function of the dimension of the state- 
space. Fig. I^a) compares the algorithms with N = 100 
proposal pdfs, whereas in Fig. [^b) we have N = 500, 
keeping hxed the number of total evaluations of the 
target E = 2- 10®. We observe, as expected, the perfor¬ 
mance of all the methods degenerate as the dimension 
of the problem, increases, since we maintain fixed 
the computational cost E = 2 ■ 10®. PI-MAIS always 
provides the best results, with the exception for the 
cases corresponding to N = 100 and = 35, 50 where 
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SMC obtains a lower MSE (for N = 100 and = 40, 
they provide virtually the same MSE). 

6.4 Localization problem in a wireless sensor network 

We consider the problem of positioning a target in a 
2-dimensional space using range measurements. This 
problem appears frequently in localization applications 
in wireless sensor networks [Tl|28l|40]. Namely, we con¬ 
sider a random vector X = [Xi,X 2 ]^ to denote the 
target position in the plane The position of the 
target is then a specific realization X = x. The range 
measurements are obtained from 3 sensors located at 
hi = [- 10 , 2 ]T, h 2 = [8,8]T and hg = [-20,-18]^. 
The observation equations are given by 

Yj = alog -k6)j, j = l,...,3, (41) 

where 0j are independent Gaussian variables with iden¬ 
tical pdfs, A/'(i?j; 0, w^), j = 1,2. We also consider a 
prior density over w, i.e., 12 ~ p(w) = A/'(a;; 0, 25)/(w > 
0), where /(w > 0) is 1 if w > 0 and 0 otherwise. The 
parameter yl = a is also unknown and we again con¬ 
sider a Gaussian prior A ^ p{a) = A/’(a;0, 25). More¬ 
over, we also apply Gaussian priors over X, i.e., p(xi) = 
N{xi] 0, 25) with j = 1, 2. Thus, the posterior pdf is 


and the covariance matrices C„ = diag((T^ i,..., cr^ 4)14 
with n = 1,...,7V. The values of the standard de¬ 
viations cr„j- are chosen randomly for each Gaussian 
pdf. Specifically, cr„j ~ U{[1,Q]), j = 1,...,4, where 
we have considered three possible values for Q, i.e., 
Q S {5,10, 30}. For the adaptation process of PI-MAIS, 
we consider also Gaussian proposals with covariance 
matrices A„ = A^l 2 and A S {5,10,70}. We also try 
different non-isotropic diagonal covariance matrices, i.e, 
A„ = diag(A2_i, Af_ 2 ), where \n,j U{[1, 30]). 

For a fair comparison, all the techniques have been 
simulated with sets of parameters that yield the same 
number of target evaluations, fixed to E = 2 • 10 ®. 
In Pl-MAIS, we have chosen parameters N = 100, 
M = {1,19,99}, T = {20,100,1000}. The PMG algo¬ 
rithms has been simulated with N = 100 and T = 2000. 
The MSE of the different estimators (averaged over 
3000 independent runs) are provided in Table and 
the log(MSE) in Figure [^b). Pl-MAIS outperforms al¬ 
ways PMG when (t„j ~ 27([1,5]) and cr„j ~ ^([Ij 10]) 
whereas PMG provides better results for an,j ~ 27([1,30]) 
Therefore, the results show jointly the robustness and 
flexibility of the proposed Pl-MAIS technique. 


7 Conclusions 


TT{xi,X2,a,u) = p{xi,X2,a,uj\y) 

oc £{y\xi,X2, a, uj)p(xi)p{x2)pia)p{uj), 

where y S is the vector of received measurements. 
We simulate d = 30 observations from the model {Dy/3 = 
10 from each of the three sensors) fixing xi = 3, X 2 = 3, 
a = —20 and w = 5. With Dy = 30, the expected 
value of the target {E[Xi] « 2.8749, E[X 2 ] « 3.0266, 
E[A] « 5.2344, E[n] « 20.1582)[3is quite close to the 
true values. 

Our goal is computing the expected value of 

{Xi,X2,A, 12) -- 7f(xi,X2,o,w) 


via Monte Garlo, in order to provide an estimate of the 
position of the target, the parameter a and the standard 
deviation w of the noise in the system. We apply Pl- 
MAIS and three different PMG schemes (see example 
in Section 6.1 for a description), all using N Gaussian 
proposals. We initialize the mean vectors so that they 
are randomly spread within the space of the variables 
of interest, i.e., 


Mn,0 


' A7(/x; 0, dO^L 


), n=l,...,N, 


® These values have been obtained with a deterministic, ex¬ 
pensive and exhaustive numerical integration method, using 
a thin grid. 


In this work, we have introduced a layered (i.e., hierar¬ 
chical) framework for designing adaptive Monte Garlo 
methods. In general terms, we have shown that such a 
hierarchical interpretation lies behind the good perfor¬ 
mance of two well-known algorithms; a random walk 
proposal within an MH scheme and the standard PMG 
method. Furthermore, we have used this approach to 
introduce a novel class of adaptive importance sam¬ 
pling (AIS) schemes. The novel class of AIS algorithms 
employs the determinist mixture (DM) idea [46l [50] in 
order to reduce the variance of the resulting IS estima¬ 
tors. We have extended the use of the DM strategy with 
respect to other algorithms available in the literature, 
providing a more general and flexible framework. From 
an estimation perspective, this framework includes dif¬ 
ferent schemes proposed in literature HH |5H1 as spe¬ 
cial cases, although they differ to an extent in terms 
of the employed adaptation procedure. Our framework 
also contains several other sampling schemes consider¬ 
ing full or partial DM approaches. Finally, we have dis¬ 
cussed several aspects of the trade-offs in terms of the 
computational cost and advantages due to improved ac¬ 
curacy of the resulting estimators. Numerical compar¬ 
isons with different algorithms on benchmark models 
have confirmed the benefit of the layered adaptive sam¬ 
pling approaches. 
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A Consistency of GAMIS estimators 

First of all, we remark that the complete analysis should 
take in account the chosen adaptive procedure since, in gen¬ 
eral, the adaptation uses the information of previous weighted 
samples. However, in this work we consider an adaption pro¬ 
cedure completely independent of the estimation steps, as 
clarified in Sections This simplifies substantially the 

analysis as described in Section|5.1[ 
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The consistency of the global estimators in Eq. ( |29[ ) pro¬ 
vided by GAMIS can be considered when number of samples 
per time step (M x N) and the number of iterations of the 
algorithm (T) grow to infinity. For some exhaustive studies 
of specific cases, see the analysis in |47l I17| and |34| . Here 
we provide some brief arguments for explaining why It and 
Zt obtained by a GAMIS scheme are, in general, consistent. 
Let us assume that q„ys have heavier tails than 7 f(x) oc 7 r(x). 
Note that the global estimator It can be seen as a result of 
a static batch MIS estimator involving L different mixture- 
proposals and J = NMT total number of samples. 

The weights built using ’Pn,t{x) in the denominator of 

the IS ratio are suitable importance weights yielding consis¬ 
tent estimators, as explained in detail in Appendi>(B| Hence, 
for a finite number of iterations T < oo, when M ^ oo (or 
N ^ oo), the consistency can be guaranteed by standard IS 
arguments, since it is well known that Zt Z and It ^ I 
as M —> oo, or A ^ cxD m- 

Furthermore, for T ^ oo and N, M < oo, we have a convex 
combination, given in Eq. ( |31[ l, of conditionally independent 
(consistent but biased) IS estimators 1471 . Indeed, although 
in an adaptive scheme the proposals depend on the previous 
configurations of the population, the samples drawn at each 
iteration are conditionally independent of the previous ones, 
and independent of each other drawn at the same iteration. 
The bias is due to unknown Z (see Eq. Q), and hat Zt is 
used to replace Z. However, Zt Z as T ^ oo, as discussed 
in [la Chapter 14]: hence. It is asymptotically unbiased as 
T —^ oo. 


B Importance sampling with multiple proposals 


Recall that our goal is computing efficiently the Integral I = 
i/;r/W 7r(x)dx where / is any square-integrable function 
(w.r.t. 7 f(x)) of X, and Z = 7 r(x)dx < oo with 7 r(x) > 0 for 
all X S T C . Let us assume that we have two proposal 
pdfs, gi(x) and 92 (x), from which we intend to draw Mi and 
M 2 samples respectively: 




.(Ml) 

■1 


' 91 (x) 


and 


,(i) 

k.2 


.(M 2 ) 

■2 


92 (x). 


There are at least two procedures to build a joint IS estimator: 
the standard multiple importance sampling (MIS) approach 
and the full deterministic mixture (DM-MIS) scheme. 


B.l Standard IS approach 


The simplest approach SZl Chapter 14] is computing the clas¬ 
sical IS weights: 


w 


(i) 

1 



(k) 

»2 = 


92 (x^'"^) 


(42) 


with i = l,...,Mi and k = 1 ,...,M 2 . The IS estimator is 
then built by normalizing them jointly, i.e., computing 


/ Ml Ah 




(43) 


where Stot = ■ Bor J > 2 proposal pdfs 

and xj^4 ■ ■ • ~ 9 j(x), for j = I,..., J, we have 


Wj — 




and 


Iis = ‘ 


2 --/ n = 1 m A = 1 


(m 


oE E 

j = l mj = l 




In this case, Stot = EEi EmEi' 


B.2 Deterministic mixture approach 

An alternative approach is based on the deterministic mixture 
sampling idea Ba ESI Eg. Considering N = 2 proposals 91 , 
92 , and setting 

^={x(^>,...,x("^'),x(^),...,x("=)}, 

with S (n S {1,2} and 1 < mj < Mj), the weights 

are now defined as 


(mj) 


W ■ = 


x(xE4 


/ (Tnj)v 


Ml 


_ Mk _Oo (X 

M1+M2 


92 (x. ) 


(44) 


In this case, the complete proposal is considered to be a mix¬ 
ture of 91 and 92 , weighted according to the number of sam¬ 
ples drawn from each one. Note that, unlike in the standard 
procedure for sampling from a mixture, a deterministic and 
fixed number of samples are drawn from each proposal in the 
DM approach |22| . It can be shown that the set Z of samples 
drawn in this deterministic way is distributed according to 
the mixture q{z) = + ai^^^ 92 (z) [45l Chapter 

9, Section 11]. The DM estimator is finally given by 


2 


(rrij) (r. 
W ■ ^ fix. 

7 J \ n 


j = lmj = l 


(45) 


where Stot = E|=i Emj = i 10 ^^^ s-''® given 

by ( |44[ ). For J > 2 proposal pdfs, the DM estimator can also 
be easily generalized: 


(mi) 

Wi = i 


/ (mi)\ 

-^(xj 


Idm — 


■spj Mj ( (mj)-. 

1 

Y^Afj (^-j) 

2L^n=l ^mj=l j 


and 


E E 

j = l mj = l 




with i = 1 ,..., J, Mtot = M-t + M 2 + ■ ■ ■ + Mj and Stot = 
Ej=i EE = i On the one hand, the DM approach is 

more efficient than the IS method, thus providing a better 
performance in terms of a reduced variance of the correspond¬ 
ing estimator, as shown in the following section. On the other 
hand, it needs to evaluate every proposal Mtot times instead 
of only Mj times (in the standard MIS procedure), and there¬ 
fore is more costly from a computational point of view. How¬ 
ever, this increased computational cost is negligible when the 
proposal is much cheaper to evaluate than the target, as it 
often happens in practical applications. 
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B.3 Convex combination of partial IS estimators 


Regardless the type of weights employed in the IS scheme (ei¬ 
ther as in Eq. ( |42[ ) or as in Eq. ( |44[ |), the resulting estimators 
can be written as convex combination of simpler ones. First of 
all, let us consider again the use of J = 2 proposals, iji and 92 - 
We draw Mj samples from each one, ^ ~ 9i(x), 

with j S {1,2}. The two partial sums of the weights cor¬ 
responding only to the samples drawn from q\ and 92 , are 
given by Si = and S 2 = ■ 'I’be partial 

IS estimators, obtained by considering only one proposal pdf, 
are /i = J]f=i and h = where 

the normalized weights are and re¬ 

spectively. The complete IS estimator, taking into account 
the Ml -|- M 2 samples jointly, is 


= ^^/l+ 

S 1 +S 2 Si-hS 2 


(46) 


This procedure can be easily extended for J > 2 different 
proposal pdfs, obtaining the complete estimator as the convex 
combination of the N partial estimators: 


ftot — 


Ei=i Siij 


(47) 


Ztot — 




y*-' , M, ^ M, 

/—<7 = 1 J 7 = 1 Z—<7 = 1 J 7 = 1 




where , 


^k=l 


Xk) 


AM,) 


Qjiy, I] = El=i«’fV(xf^), Sj = 


Zj = ^ : 


C Hierarchical interpretation of PMC 


The standard Population Monte Carlo (PMC) |12| method 
can be interpreted as using a hierarchical procedure. Although 
it is possible to recognize the two different layers, there are 
some differences w.r.t. the hierarchical procedure in Section 
The first one is that in PMC the generation of /r’s is not 
independent of the previously generated x’s. The second one 
is that the prior is instead /i(/i) = 7f(^^(/x), where is 

an approximation of the measure of 7f(/r) obtained using the 
previously generated samples x’s (In the second level of the 
hierarchical approach). More specifically, a standard PMC 
method m is an adaptive importance sampler using a popu¬ 
lation of proposals qi, ..., qj^. PMC consists of the following 
steps, given an initial set, • • ■, MiV.Oi of mean vectors: 

1. For i = 0,. .. ,T - 1 : 

(a) Draw x„,t ~ g„,t(x|/x„,t, C„), for n = 1,..., A. 

(b) Assign to each sample x„_t the weights, 


'^n,t — 


7r(Xn,t) 

qn,t{_^n,t\fJ‘n,ti Cn) 


(48) 


(c) Resampling: draw N independent samples fj,ri,t+i, n = 
1,..., A, according to the particle approximation 


(M|xi:iV,t) = =gjv- E - Xn,t), 

En=l "'"4 n = l 

(49) 

where we have denoted xi.jv,t = Note 

that each /x„ t_|_i E {xi,t,..., xjv.t}, for all n. 


2. Return all the pairs {x„^t ,n = 1,...,A and t = 
0,...,T-1. 

Fixing an iteration t, the generating procedure used in one 
iteration of the standard PMC method can be cast in the 
hierarchical formulation: 

1. Draw A samples ...,/xjv,t from (/r|xi;jv,t-i)- 

2. Draw ~ 7 n,t(x|/x„,t, C„), for n = 1,... , A. 

Note that pla-ys the role of the prior h in the hierarchical 
scheme above. Differently from the novel proposed scheme, 
the two levels of hierarchical procedure are not independent 
since the pdf (/i|xi;jv,t) depends on the samples drawn 
in the lower level. Furthermore, also varies with t and 

A, whereas in our procedure we consider a fixed prior h. How¬ 
ever, note that is an empirical measure approximation 

of n that improves when A grows. An equivalent formula¬ 
tion of the hierarchical scheme for PMC Is given below. In¬ 
volving a probability of generating a new mean fi given the 
previous ones ■ ..,denoted as 

(MlMl:iV,t-l)- 


C.l Distribution after one resampling step 


Consider the t-th iteration of PMC. Let us define as 


-in — [xi.t,-.., — , X7i-|-iy, . . . , Xjyt] 


the vector containing all the generated samples except for the 
n-th. Let us also denote as S {xi,t ... ,xjv,t}, a generic 

mean vector, i.e. i S (I,..., A} at the iteration i-|- 1, after 
applying one resampling step (i.e., a multinomial sampling 
according to the normalized weights). Hence, the distribution 
of fj. given the previous means is 


E+l (Mi4+l|Ml4. ■ ■ • iMAT.t) 


P 

■ N 


9n,£ , Cti) 

Jx^ 

.n=l 


dxi;jv,t, 


(50) 


where n E^(m|xi :N,t) is given in Eq. ( |49[ |. For simplicity, be¬ 
low we denote 


(j„(x) = g„,t(x|/x„,t, C„), and /r = mg. 


Then, after some straightforward rearrangements, Eq. ( |50[ l 
can be rewritten as 

■ ■ ■ , MJV.t) = 


= E 


jxN-l 




V 


E iV ^ (^Tl ,t ) 
71 = 1 qn(^n,t) 


n 9„(x„,t) 

71=1 


rfm_, 


S(/J, - Xjg). 


Finally, we can write 


r(M) E / ^ 

^ ' Ew-i NZ 


j=i 


qni.Xn.t') 


n=l 


dm- 


(51) 


where Z = A ^ ^ is the estimate of the normalizing 

constant of the target obtained using the classical IS weights. 
The hierarchical formulation of PMC can be rewritten as: 
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1. Draw N samples m,t, • • ■, Miv.t from (/x|/xi;iv.t-i) in 

Eq. @ or 1^. 

2. Draw ~ C„), for n = 1,..., A. 

When A ^ oo, then Z ^ Z 1471 . and thus —>■ 

^7r(/j,) = for alH = 1..., T. Namely, when A grows, the 

hierarchical scheme above tends to have h(/i) = v{fi) as prior 
in the upper level. Figures show three different examples 
of the conditional pdf (obtained via numerical approx¬ 

imation) for a fixed t and different A S {2, 20,1000}. We can 
observe that becomes closer to the target if (depicted 

in solid line) as A grows. 

C.1.1 Differences between PMC and MAIS algorithms 

In the Markov adaptive importance sampling (MAIS) schemes 
described in Section]^ since we are using MCMC methods for 


drawing from h{fj.) = 7r(/x), actually we have also a current 
prior determined for the kernels of 

the considered MCMC algorithms. For instance, in PI-MAIS 
we have 

N 

n=l 

where A„(/j.„_t|/i„_t_i) is the kernel of the n-th chain. Unlike 
in PMC, since we are using ergodic chains with invariant pdf 

ff, we know that (/xi,jv,tlMi:iV,t-i) ^ 11^=1 for 

t —f oo, with a fixed A. Whereas PMC requires to increase A 
for obtaining the same result. 
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Fig. 5 Examples of (/x|/ri;jv,t-i) (approximated numerically and shown with dashed line) and a bimodal target pdf ■7r(x) 
(solid line), fixing an iteration t within a PMC method and for different N: (a) N = 2, (b) N = 20 and (c) N = 1000. 


Algorithm 

O- = 0.5 

(7—1 

(7 = 2 

(7 = 5 

9 

II 

o 

1! 

-sj 

o 

<Tr.,3 ~ W([l. 10]) 



M = 99, T = 20 

1.2760 

0.5219 

0.5930 

0.0214 

0.0139 

0.1815 

0.0107 


A = 5 

M = 19,T = 100 

0.2361 

0.1205 

0.0422 

0.0087 

0.0140 

0.1868 

0.0052 



M = 1, T ^ 1000 

0.1719 

0.0019 

0.0155 

0.0103 

0.0273 

0.3737 

0.0070 



M 99, T ^ 20 

1.0195 

0.1546 

0.2876 

0.0178 

0.0133 

0.1789 

0.0098 


A ^ 10 

M = 19,T ^ 100 

0.1750 

0.0120 

0.0528 

0.0086 

0.0136 

0.1856 

0.0050 

PI-MAIS (TV = 100) 


M = = 1000 

0.1550 

0.0021 

0.0020 

0.0095 

0.0252 

0.3648 

0.0066 



M ^ 99, T ^ 20 

16.9913 

5.5790 

1.4925 

0.0382 

0.0128 

0.1834 

0.0252 


A = 70 

M = 19,T ^ 100 

2.6693 

0.9182 

0.1312 

0.0147 

0.0143 

0.1844 

0.0120 



M 1, T ^ 1000 

0.3014 

0.1042 

0.0136 

0.0115 

0.0267 

0.3697 

0.0093 



M = 99, T = 20 

1.0707 

0.5364 

0.3523 

0.0199 

0.0121 

0.1919 

0.0094 


A,,,, ~tT([l,10]) 

M = 19,T ^ 100 

0.2481 

0.0595 

0.1376 

0.0075 

0.0144 

0.1899 

0.0049 



M ^ 1, T ^ 1000 

0.1046 

0.0037 

0.0045 

0.0099 

0.0274 

0.3563 

0.0065 

Static standard MIS 


= Tn.tfx) 

29.56 

41.95 

64.51 

2.17 

0.0147 

0.1914 

4.55 

Static partial DM-MIS 


= 0t(x) 

29.28 

47.74 

75.22 

0.2424 

0.0124 

0.1789 

0.0651 

AMIS [m 

(best results) 

124.22 

121.21 

100.23 

0.8640 

0.0121 

0.0136 

0.7328 

(worst results) 

125.43 

123.38 

114.82 

16.92 

0.0128 

18.66 

13.49 










PMC 1121 



112.99 

114.11 

47.97 

2.34 

0.0559 

2.41 

0.3017 

PMC WITH PARTIAL DM-MIS 

N = 100, T = 2000 

111.92 

107.58 

26.86 

0.6731 

0.0744 

2.42 

0.0700 

Mixture PMC llfl 



110.17 

113.11 

50.23 

2.75 

0.0521 

2.57 

0.6194 

Parallel Indep. MH chains 

N = 100,T = 2000 

1.6910 

1.7640 

1.8832 

1.4133 

0.2969 

0.5475 

7.3446 


Table 9 (Ex-Sect 6.11 MSE of the estimator of the i?[X] (first component) with the initialization Ini. For all the techniques, 
the total number of evaluations of the target is E = 2 ■ 10®. We recall that, in AMIS |15| . TV = 1 and ^i^t(x) = 5i(x). The last 
row corresponds to the application of TV = 100 (as in PI-MAIS) parallel MH chains where the random walk proposals have 
covariance matrices C = cr^l 2 . The lengths of the chains, as well as of the PMC runs, is T = 2000 for keeping E = 2 ■ 10®. For 


the techniques which adapt the covariance matrices of the proposal pdfs, the values of a have been employed as initial scale 


values for the covariance matrices. For AMIS, we show the best and worst results obtained testing different combinations of 
M and T = The best results, in each column, are highlighted with bold-faces. 
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LU 

(/) 


O) 

O 



0 2 4 6 8 

log(o) 


1 2 3 

Experiment 


(a) Ex-Sect |6.l| 


(b) Ex-Sect |6.4| 


>.l|6.4 1 Summary of the results in Tablein Fig. (a), and Table 13 in Fig. (b): the curve log(MSE) of the 


Fig. 6 (Ex-Sect 

different methods as function of log((T) in Fig. (a) (cr E {0.5,1, 2, 5,10, 70}), and as function of the different experiments in Fig. 
(b). The worst and best results of PI-MAIS are depicted with triangles up and down, respectively. 


Algorithm 


PI-MAIS {N = 100) 


A, 




Static standard MIS 
Static partial DM-MIS 

AMIS rrsi 

PMC Il2l 

PMC WITH partial DM-MIS 
Mixture PMC llll 

Parallel Indep. MH chains 


<7 — 0.5 <7=1 


(7 = 2 


(7 = 5 cr = 10 


(7 = 70 


X = 5 


A = 10 


M = 99,T = 20 
M ^ 19, T ^ 100 
1000 
= 20 


M = 1,T = 
M = 99,r 


M = 19,T = 100 
M = 1,T = 1000 


0.6096 

0.2878 

0.1244 

0.9236 

0.2294 

0.0786 


0.0657 

0.0358 

0.0011 

0.0543 

0.0077 

0.0042 


0.0023 

0.0010 

0.0014 

0.0021 

0.0012 

0.0014 


0.0056 

0.0050 

0.0091 

0.0062 

0.0054 

0.0086 


0.0124 

0.0127 

0.0242 

0.0137 

0.0132 

0.0256 


0.1768 

0.1802 

0.3510 

0.1815 

0.1890 

0.3503 


A = 70 


M = 99, T = 20 
M = 19,T = 

M =1,T = 


= 100 
1000 


5.9889 

1.6670 

0.2579 


0.3662 

0.0871 

0.0134 


0.0082 

0.0045 

0.0024 


0.0089 

0.0080 

0.0097 


0.0140 

0.0139 

0.0258 


0.1841 

0.1971 

0.3543 


M = 99, T = 20 
^ W([l, 10]) M ^ 19,T = 100 
M = 1,T = 1000 


0.5623 

0.2704 

0.0750 


0.0417 

0.0204 

0.0014 


0.0025 

0.0011 

0.0013 


0.0059 

0.0048 

0.0089 


0.0124 

0.0136 

0.0247 


0.1848 

0.1726 

0.3540 




4>r.,t(x) = 0t(x) 


12.00 

10.14 


9.40 

0.9469 


10.26 

0.0139 


7.67 

0.0100 


0.5443 

0.0146 


0.1764 

0.1756 


(best results) 


(worst results) 


113.97 

116.66 


112.70 

115.62 


107.85 

111.83 


44.93 

70.62 


0.7404 

9.43 


0.0141 

18.62 


N = 100, T = 2000 


N = 100,T = 2000 


111.54 110.78 

23.16 7.43 

25.43 I 10.68 
1.3813 I 1.3657 


90.21 

7.56 

6.29 

1.2942 


2.29 

0.6420 

0.6142 

1.0178 


0.0631 

0.0720 

0.0727 

0.3644 


2.42 

2.37 

2.55 

1.0405 


0.0051 

0.0038 

0.0064 

0.0054 

0.0044 

0.0066 

0.0093 

0.0074 

0.0082 

0.0056 

0.0037 

0.0066 

4.37 

0.0106 

31.02 

58.63 

0.3082 

0.0695 

0.1681 

5.3211 


Table 10 (Ex-Sect 


6.11 


MSE of the estimator of the expected value (first component). For all the techniques, the total 
number of evaluations of the target is again E = 2 • 10^. In this case, we have applied the initialization In2, differently from 
TableThe best results, in each column, are highlighted with bold-faces. 
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Algorithm 

cr = 0.5 

a — 1 

17 = 2 

<7 = 5 

<7 = 10 

<T = 70 

~W(11.10]) 





M = 99, T = 20 

0.0388 

0.0120 

0.0070 

0.0002 

0.0001 

0.0016 

0.0001 




A = 5 

M ^ 19, T 100 

0.0031 

0.0013 

0.0004 

0.0001 

0.0001 

0.0017 

0.0001 





M ^ 1, T ^ 1000 

0.0016 

0.0001 

0.0001 

0.0001 

0.0002 

0.0031 

0.0001 





M = 99,T = 20 

0.0217 

0.0046 

0.0040 

0.0001 

0.0001 

0.0016 

0.0002 




o 

II 

M = 19, T = 100 

0.0019 

0.0002 

0.0005 

0.0001 

0.0001 

0.0017 

0.0001 

PI-MAIS (N = 100) 




M ^ 1, T ^ 1000 

0.0016 

0.0001 

0.0001 

8 -10^® 

0.0002 

0.0031 

0.0001 





M ^ 99,T = 20 

6.3732 

0.2713 

0.0226 

0.0003 

0.0001 

0.0016 

0.0002 




II 

-4 

O 

M = 19, T = 100 

0.1082 

0.0114 

0.0019 

0.0001 

0.0001 

0.0017 

0.0001 





M = 1,T - 1000 

0.0038 

0.0009 

0.0001 

0.0001 

0.0002 

0.0033 

0.0001 





M ^ 99,T ^ 20 

0.0350 

0.0101 

0.0043 

0.0001 

0.0001 

0.0015 

0.0001 




A„., ~ W([1.10]) 

M ^ 19, T ^ 100 

0.0029 

0.0007 

0.0010 

8 -10^® 

9 -10“® 

0.0017 

9 -10 ® 





M ^ 1, T ^ 1000 

0.0014 

0.0001 

9•10“® 

0.0001 

0.0002 

0.0036 

0.0001 

Static standard MIS 

= 

= 9..,t(x) 

3.94 -10* 

7.12 -10' 

1.07 -10® 

0.0113 

0.0001 

0.0016 

0.2190 

Static partial DM-MIS 

S>r.,t(x) 

= 0t(x) 

9.5110® 

4.60 10® 

15.34 

0.0016 

0.0001 

0.0016 

0.0005 

AMIS [H] 

(best results) 

15.92 

15.66 

12.81 

0.0069 

8 -10 ® 

0.0001 

0.0002 

(worst results) 

15.97 

15.92 

14.87 

0.4559 

0.0001 

1.62 

0.0084 










PMC ll^ 



33.53 

17.10 

14.42 

0.4249 

0.0015 

0.0016 

0.3542 

PMC with partial DM-MIS 

N = 100, T = 2000 

15.85 

14.31 

1.81 

0.0402 

0.0002 

0.0016 

0.0004 

Mixture PMC 1111 



14.51 

12.09 

3.56 

0.0287 

0.0002 

0.0015 

0.0010 

Table 11 (Ex-Sect 

6.1 

1 MSE of the estimator of the normalizing constant Z with the initialization Ini 

For all the techniques. 


the total number of evaluations of the target is E = 2 ■ 


Algorithm 

cr = 0.5 

<J = 1 

<7 = 2 

<7 = 3 

<7 = 5 

cr = 10 

cr = 70 

o-ij ~W([1,20]) 

PI-MAIS 

Worst 

0.0083 

0.0081 

0.0012 

0.0005 

0.0050 

0.0126 

0.1126 

0.0218 

Best 

0.0025 

0.0001 

0.0002 

0.0001 

0.0002 

0.0003 

0.0361 

0.0004 

I^-MAIS 

Worst 

0.0335 

0.0227 

0.0053 

0.0044 

0.0041 

0.0096 

0.2130 

0.0181 

Best 

0.0082 

0.0025 

0.0013 

0.0008 

0.0001 

0.0002 

0.0265 

0.0003 

PMC [12] 

Worst 

0.0670 

0.0461 

0.0209 

0.0093 

0.0055 

0.0072 

9.4749 

0.1065 

Best 

0.0210 

0.0164 

0.0069 

0.0016 

0.0015 

0.0011 

0.0262 

0.0026 

Mixture PMC [Tl] 

Worst 

3.5772 

0.0113 

0.0044 

0.0066 

0.0174 

0.0267 

0.0913 

0.0103 

Best 

0.0092 

0.0020 

0.0018 

0.0035 

0.0034 

0.0055 

0.0138 

0.0025 

AMIS dH 

Worst 

0.0040 

0.0039 

0.0040 

0.0016 

0.0011 

0.0012 

0.0035 

0.0013 

Best 

0.0023 

0.0028 

0.0023 

0.0009 

0.0003 

0.0004 

0.0023 

0.0007 


Table 12 (Ex-Section- 

obtained with the different techniques for different values of a. The smallest MSE for each a Is bold-faced. 


6.2 ( Bi-dimensional banana-shaped distribution example: Best and worst results in terms of MSE, 


Algorithm 

<Ti.,~W([l,5j) 

<Ti,,~W([l,10J) 

o-ij ~W([1,30]) 



M = 99,T = 20 

0.3819 

0.3508 

0.3626 


A = 5 

M = 19,r = 100 

0.0728 

0.0738 

0.0710 



M = l,r = 1000 

0.0173 

0.0164 

0.0171 



A7 = 99,T = 20 

0.5701 

0.5943 

0.5605 

PI-MAIS 

A = 10 

M = 19,r = 100 

0.1389 

0.1429 

0.1425 



M = i,r = 1000 

0.0401 

0.0408 

0.0393 



A7 = 99,T = 20 

0.3758 

0.3795 

0.4028 


Aij ~W([1,30]) 

M = 19,T = 100 

0.0741 

0.0793 

0.0771 



M = l,r = 1000 

0.0169 

0.0167 

0.0162 

PMC 112J 



0.0642 

0.4345 

0.1533 

PMC WITH PARTIAL DM-MIS 

N = 100, T = 2000 

0.0524 

0.3163 

0.0817 

Mixture PMC [llj 



0.0577 

0.2870 

0.4083 


Table 13 (Ex-Sect 


6.41 


MSE of the estimator of E[{Xi, X 2 , A, H)] using different techniques, keeping constant the total 
number of target evaluation, E = 2 10®. The best results. In each column, are highlighted with bold-faces. 
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(a) Worst results. 



6.21 


Graphical representation of the results in Table [l^ (except for the last column): the curve log(MSE) 


versus log((T) with a g {0.5, Ij 2, 3, 5,10, 70} for the different methods, (a) worst and (b) best results. 




Fig. 8 (Ex-Section- 


6.3 I The 


rve log(MSE) as function of dimension of the problem, S 

{2,3,5,10,12,15,20,25,35,40,50}, for different methods. We test (a) N = 100 and (b) N = 500, keeping fixed the 
same number of evaluation of the target E = 2 ■ 10®. Hence the total number of iterations (of the different algorithms) is 
greater in Fig. 9(a) than in Fig. 9(b). 




































